Country & Soul


Textual Analysis

Table of Contents

  1. Introduction
    1. Add Metadata Features
  2. Vector Space Modeling
    1. Doc-Term Matrix
    2. Time-Term Matrix
  3. Clustering
    1. Dendrogram Representation
    2. t-SNE Representation
  4. Principal Component Analysis
  5. Topic Modeling
    1. Corpus Level Topics
    2. Genre Level Topics
  6. Word Embeddings
  7. Sentiment Analysis
    1. NRC Lexicon
    2. VADER Ananlysis

Introduction

In this notebook, I apply several analytical techniques to the corpus of music reviews. I employ vector space models (VSM), clustering methods, Principal Component Analysis (PCA), topic models, word embeddings, and semantic analysis.

To begin, I read in the tables that I created in Digital Analytical Edition.ipynb.

Add Features to Metadata

I add some features to the LIB table to aid with vizualization. I add a column that identifies the most popular artists in the corpus. I also add a column which identifies the most common publications. Finally, I add a column which records the difference in days of the current publication and the first publication in the corpus.


Vector Space Modeling

Here, I create several vector space models (VSMs). The first VSM is a straightforward document-term matrix. This representation allows me to asses the similarity of documents in the corpus.

The second VSM is a term-time matrix that measures term occurence over time. Terms are assigned a time step based on the order of their occurence in an article. Publication dates are used to order articles within the corpus.

Doc-Term Matrix

Analysis
After creating a document-term matrix representation of the corpus, I projected the documents onto the first five principal components. I plotted the five PCs against each other and looked for clusters of genre, artist, publication, and date.

Figure 1 shows a striking separation by genre along the second PC. Although it would be improper to attribute meaning to the PCs, it seems like the second PC measures genre-specific language and captures the opposition between country music articles and soul music articles.

Figure 2 looks at a subset of articles in the corpus. Only articles that discuss the top five soul musicians or top five country musicians are plotted along the second and third PCs. Here, there is some clustering by artist. Articles discussing Prince separate themselves along the third PC. There are also distinc clusters for Stevie Wonder, James Brown, and Aretha Franklin. Interestingly, these are all soul artists. This could suggest that language used to describe soul musicians is more distinctive than the language used for country musicians. Or, this could simply be an artifact caused by including four times as many soul articles as country articles in the corpus. In either case, the second and third PCs seem to distinguish articles by artist.

Figure 3 investigates the most common journals in the corpus. I hypothesized that the articles published by the same source would cluster together. This might suggest that different publications used unique stylistic features in their music journalism. Clearly there are no clusters or discernible patterns in the figure. This could indicate that the top 10 publications are not stylistically different, or that stylistic features are not particularly significant when examining document similarity.

Figure 4 attempts to show temporal patterns. There seems to be a gradiation of older articles to newer articles along the second PC. Although the trend is a little fuzzy, it is sharp enough to assume that there was a shift in the language used in music journalism over the period of study.

Time-Term Matrix

First, I represent the corpus with a Time-Term matrix. Often, time-term matrices relate the instances of a term to it's position in a document. Since the documents in this corpus are relatively short, I am more interested in word occurence over the length of corpus. I create two matrices. The first, CALENDAR_TIME, records n-gram occurence over actual calendar dates. Suppose an n-gram occurs in an article at least once. Then, CALENDAR_TIME would record a '1' for that n-gram on the article's publication date. If an n-gram does not occur in any articles on that date, a '0' is recorded.

Unfortunately, the publication dates of articles in the corpus are not uniformly distributed. N-grams may display artificially high frequency in time periods with many articles. I attemp to mitigate this with the second matrix, CORPUS_TIME. This matrix is essentially the same as CALENDAR_TIME, except it does not factor in publications dates. Instead, it records n-gram occurence based on the sequence of articles in the corpus. For example, if a given n-gram occurs in the third article in the corpus, a '1' would be recorded for the n-gram at time-step 3.

I unstack the nBOW to create each matrix. nBOW contains the 20,000 most frequent n-grams in the corpus, instead of just unigram terms. This choice was made so that full artist names, like Aretha Franklin, can be compared.

I create two functions for vizualiazing Kernel Density Estimates (KDE.

Analysis

Fig. 5 visualizes the KDE for significance of term banjo over calendar time. This plot is primarily a proof of concept, but this double-peaked plot is interesting in itself. It suggests that the term banjo, was most significant in the mid 1970s before seeing a resurgence in the 2000s.

Fig. 6 investigates the significance of different audio formats over corpus time. Vinyl appears somewhat consistently throughout the corpus, whereass cassette and cd exhibit a prominent rise and fall.

Fig. 7 attempts to show this method can be used to measure an artists career. I compared four well-known country musicians who achieved peak notoriety in different eras. The plot shows that Trisha Yearwood has much higher peak significance within the corpus than Dolly Parton and Loretta Lynn. This is counter to my intuition. This plot says less about the indvidual careers of the artists and more about the nature of celebrity over time.

Fig. 8 draws contrast with Figure 7 by plotting the same country artists over corpus time. The plots are qualitatively very similar. This could mean that an uneven distribution of articles over time may not be a detrimental as initially thought.

Fig. 9 & Fig. 10 show artists who left their bands to go solo. Fig. 9 could suggest that Diana Ross was able to extend her significance in the media by going solo. It is worth noting that this corpus is likely inappropriate for comparing the significance of The Supremes and Diana Ross in music journalism. Most would argue that Diana Ross was affiliated with Disco more than Soul and R&B in her individual career. Thus, she is not well represented in the corpus. It is easier to claim that The Supremes are classified as a soul group. Still, a more expansive corpus would certainly improve this analysis.

Fig. 10 shows that Lauryn Hill's individual significance parallels the Fugees. I hypothesize that journalists may often refer to the fugees when writing about lauryn hill as a way of contextualizing her career.


Clustering

Dendrogram Representation

I apply hierarchical agglomerative clustering at the document level to identify more patterns. As an initial step, I reduce the vocabulary since agglomerative clustering is computationally expensive. From the N-gram bag-of-words table, I select the top 10% open-category terms by mean tfidf.

Next, I apply hierarchical agglomerative clustering.

Clearly, there are too many documents to benefit from a dendrogram. A t-SNE representation may be more useful.

t-SNE Representation

t-SNE does an excellent job of clustering articles by artist. While this leads to a lovely visualization, it's not exactly insightful. It's likely t-SNE recognizes that artist and band names are the most discriminating terms in the corpus. This clustering surfaces information that could be otherwised learned by scraping the article titles.

Not pictured here, I performed clustering on articles with proper nouns removed. Articles no longer clustered by artist. In fact, there was no discernible patterns to clusters.


Principal Component Analysis

I previously used PCA to investigate clustering patterns at the article level. In this section, I will formally construct two tables that record principal components and loadings. The first table, DCM, is a Document-Component Matrix where each row represents a document and each column is a principal component. The second table is LOADINGS which records the loading vectors for the principal components. Each row is a loading vector for a specified term.

As a first step, I create a dataframe of normalized TFIDF values. To create the TFIDF dataframe, I unstack the unigram bag-of-words. Note that this bag-of-words has stopwords removed and TFIDF values are already normalized.

Earlier in this notebook, I investigated the corpus along the first 5 principal components. I attached metadata features and attempted to identify clusters. For the sake of brevity, I do not create additional PCA plots here. Instead, I plot the loadings and hypothesis how terms in the corpus are related to the principal components.

The loadings plots don't reveal any obvious global patterns. There are some local groupings of similar words. For instance 'johnny', 'cash', 'willie', and 'nelson' can all be found near each other. This could mean that the loadings capture some semantic information, but this is a tough case to argue.

I write the DCM and LOADINGS tables to csv files for later use.


Topic Modeling

I perform topic modeling on the corpus to uncover latent topics. I use scikit-learn's Latent Dirichlet Allocation implementation as the topic modeling method. Two tables, THETA and PHI, capture the document-topic matrix and topic-term matrix respectively.

Corpus Level Topics

Next, I explore topics by genre, artist, and article type.

I notice that there is significant overlap in top terms for each topic. I attempt to find the most relevant, discriminating terms for each topic.

Genre Level Topics

Topics at the corpus level seem to subdivide based on genre. In this section, I perform topic modeling on each genre separately. I hope that this will uncover different topics that reveal a latent structure in pieces of music journalism.

Topic modeling at the genre level does not reveal topics that are any more interesting than those at the corpus level. For each genre, the top terms are usually artist names and descriptors of the artists' style. For instance, country topic 3 is identified by the relevant terms "kd", "langs", "singin", "liberal", "colored", and "contradiction". Clearly, this topic is about k.d. lang, and how her music might differ from the genre as a whole. (Maybe her music is "liberal" whereas the genre is conservative.)

Some topics are present at both the corpus and genre level. For example, genre topic 8 and soul topic 8 both seem to be about Dr. John, Allen Toussaint, The neville brothers, and New Orleans Funk.

What conclusions can be drawn from topic modeling on this corpus? Given the narrow focus of the corpus and uniformity of documents, it is unsurprising that topic modeling does not uncover wholly distinct topics. It's possible that initializing the algorithm with a smaller number of topics may find more disjoint groupings.


Word Embeddings

To create word embeddings, I use the gensim word2vec implementation.

The two t-SNE plots above use the word embeddings to cluster semantically similar terms. The first plot shows all terms in the corpus. The second plot isolates just the top 2000 terms by dfidf. Terms cluster with similar parts of speech. For instance, proper nouns cluster at the top of the image. Verbs are scattered to the bottom. Regular nouns group to the left.

Next, I test Dr. Rafael Alvarado's analogy methods. These methods, complete_analogy and get_most_similar, perform semantic algebra operations on provided terms. Given terms A, B, and C,complete_analogy returns the term D that best satifies the analogy A:B::C:D. get_most_similar returns a list of the most similar terms to a provided term.

I test the analogy, `contry : banjo :: soul : ___`. The method determines that *sax*, *trombone*, and *keyboards* are the best fit . This analogy should seem sensible to anyone familiar with the genres. The analogy method tends to be less accurate when more specific terms are used.

Sentiment Analysis

I assess the sentiment of the corpus using the NRC and VADER lexicons. The NRC lexicon will allow me to measure emotion as well as sentiment. The VADER lexicon was developed to measure sentiment in social media. I hypothesize that music reviews are similar in structure to product reviews on social media sites like Yelp and Facebook, and thus VADER might be well suited to this corpus.

NRC Lexicon

I use the NRC lexicon to investigate emotion within the corpus.

I attach the NRC values to the VOCAB table. Then, I compute average emotion, sentiment, and VAD values for each document in the corpus.

The image above, plots each article in valence-arousal-dominance space. Points are colored by genre. There seems to be positive correlation between dominance and arousal and dominance and valence. There is a negative correlation between valence and arousal. This is made more explicit in the subplots, below. The inverse relationship between valence and arousal suggests that negative words (low valence) are more intense (high arousal)--at least in this corpus. Words that are positive (high valence) and words that are energetic (high arousal) are likely to be dominant.

The stacked bar chart above displays a sample of 8 articles and their average NRC emotion values. Anticipation, Joy, and Trust seem to be the most prevalent emotions. There are likely a lot of anticipation-related words since music journalism often discusses forthcoming albums and shows.

The second stacked bar chart shows article types. The chart is sorted by polarity so that the article types at the top are more positive and the types at the bottom are more negative. Profiles and interviews are more positive than reviews which are more positive that reports and essays.

VADER Analysis

The VADER sentiment intensity analyzer is applied to sentences.

I sample some sentences that have been tagged using the VADER SentimentIntensityAnalyzer. The neg, neu, and pos columns are the proportions of negative, neutral, and positive terms in the sentence, respectively. The compound column captures sentiment with a single, weighted and normalized statistic. The sample above shows that VADER does a pretty good job, but it's not perfect. It does well to mark sentence [s14019, 6, 1] as very negative, but it misclassifies sentence [s7308, 68, 1] as neutral.

I aggregate the VADER sentiment statistics for each article. Then I analyze sentiment scores across genre and article type.

The results of VADER sentiment analyis are quite similar to the NRC analysis despite the differences in methodology. This may be due to the aggregating the sentiments of multiple articles under a single article type. In this way, the extreme sentiments of any given article are smoothed over. The average sentiment appears as neutral to slightly positive.